Applied Data Science Capstone by IBM/Coursera

Week 2

Introduction: Business Problem

In this project we will try to find optimal locations for our investments. Specifically, this report will be targeted to stakeholders interested in opening shops in Niedersachsen (Lower Saxony), Germany. Niedersachsen is a Land (German state) in nortwestern Germany. It is the second-largest state by land area, and fourth-largest in population (7.9 million) among the 16 Länder federated as the Federal Republic of Germany.

Since there are lots of informations about big cities we will try to detect locations that are not already crowded with shops. We are also particularly interested in rural areas. We would also prefer locations with middle level population.

We will use our data science powers to generate more information about promissing cities based on this criteria. Advantages of each area will then be clearly expressed so that best possible locations can be chosen by stakeholders.

Data

Based on definition of our problem, factors that will influence our decission are:

  • number and types of existing shops in the city (any type of shop)
  • number of and distance to big cities in the vicinity, if any
  • distance of area from city centers

We decided to use locations, centered around city center, to define our area.

Following data sources will be needed to extract/generate the required information:

  • Approximate addresses of centers of those areas will be obtained using geopy.geocoders
  • Number of shops and their type and location in every city will be obtained using Foursquare API
  • List of cities and their populations will be obtained using www.citypopulation.de of Niedersachsen state

Import necessary Libraries

In [623]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot') # optional: for ggplot-like style

Define Foursquare Credentials and Version

In [44]:
CLIENT_ID = 'JUZASREC0SKEE14BSG0C44LBR4ROZIHIYT0JBGZ5XG5NVSBM' # your Foursquare ID
CLIENT_SECRET = 'ZETXHBITZVSLOCK2KYFONDFFUW5WQYGSYDRPZGDWOKHLG1MX' # your Foursquare Secret
VERSION = '20200807'
LIMIT = 100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
Your credentails:
CLIENT_ID: JUZASREC0SKEE14BSG0C44LBR4ROZIHIYT0JBGZ5XG5NVSBM
CLIENT_SECRET:ZETXHBITZVSLOCK2KYFONDFFUW5WQYGSYDRPZGDWOKHLG1MX

Let's see the map of Niedersachsen

Red indicate Areas with High-Population

In [475]:
from IPython.display import Image
Image(filename='Niedersachsen.jpg') 
Out[475]:

This map shows the population density in Niedersachsen. The population density is related to the diversity of the shops in that area. The information provided on this map corresponds to the information we have acquired ourselves. It will be appropriate to be seen for control purposes.

All Cities in Niedersachsen

Let's find list and populations for centroids of all cities.

This Data is extracted from "https://www.citypopulation.de/"

In [476]:
Niedersachsen_population=pd.read_html("https://www.citypopulation.de/en/germany/niedersachsen/")[1]
NSachsen_Pop=pd.read_csv("Nieder_Cities without Location.csv") #I had to delete some cities manually.
NSachsen_Pop = NSachsen_Pop.rename(columns={'PopulationEstimate2019-12-31': 'Population'})
NSachsen_Pop=NSachsen_Pop.drop(columns=["Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1", "Unnamed: 0.1.1.1","PopulationEstimate1995-12-31","PopulationEstimate2001-12-31","PopulationCensus2011-05-09","Unnamed: 7"])
NSachsen_Pop.head()
print(NSachsen_Pop.shape)
(839, 4)

839 cities have been identified in the state of Niedersachsen. City refers to all kinds of settlements. Some of these cities are the size of large cities. However, Niedersachsen is generally a low population density state. This is why Niedersachsen usually has small settlements.

Population Histogram : All Cities

In [462]:
count, bin_edges = np.histogram(NSachsen_Pop['Population'])
NSachsen_Pop['Population'].plot(kind='hist', figsize=(16, 5))
plt.title('Niedersachsen All Cities') # add a title to the histogram 
plt.ylabel('Number of Cities') # add y-label 
plt.xlabel('Population') # add x-label
plt.show()

In this histogram, it can be seen that most of the cities are small. In other words, the population of the cities is usually less than 50000.

Latitude & Longitude Coordinates for All Cities

In [42]:
latitude_niedersachsen=[]
longitude_niedersachsen=[]

for i in NSachsen_Pop['Name']:
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(i)
    latitude_niedersachsen.append(location.latitude)
    longitude_niedersachsen.append(location.longitude)

NSachsen_Pop['Lat']=latitude_niedersachsen
NSachsen_Pop['Long']=longitude_niedersachsen     
In [465]:
NSachsen=NSachsen_Pop[['Name','Lat','Long','County','Population']]
print(NSachsen.shape)
NSachsen.head()
(839, 5)
Out[465]:
Name Lat Long County Population
0 Achim 53.006881 9.042550 Verden 31923.0
1 Adelebsen 51.579484 9.752448 Göttingen 6245.0
2 Adelheidsdorf 52.565616 10.057166 Celle 2641.0
3 Adenbüttel 52.381230 10.451016 Gifhorn 1787.0
4 Adendorf 53.281748 10.439299 Lüneburg 10853.0

Location information for all cities was obtained through geopy.geocoders. Information about shops in all cities will be examined. We will direct our study based on this information.

Map (all cities)

In [328]:
def toSimpleMap (Map, latitudes, longitudes, LocationNames):

    

    for lat,lng,city in zip(latitudes, longitudes, LocationNames):
        label = '{}'.format(city)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Map)    

        
#                     ***to use, example***        
#Map = folium.Map(location=[53.007, 9.043], zoom_start=7)
#toSimpleMap (Map, NSachsen_Pop['Lat'], NSachsen_Pop['Long'],NSachsen_Pop['Name'])
#Map     
In [62]:
Map = folium.Map(location=[53.007, 9.043], zoom_start=7)
toSimpleMap (Map, NSachsen['Lat'], NSachsen['Long'],NSachsen['Name'])
Map
Out[62]:
Make this Notebook Trusted to load map: File -> Trust Notebook

All cities are mapped. This includes very small settlements. So we can get an overview of the state of Niedersachsen. However, when we examine the map closely, it is seen that the city centers are not located at the center of the residential areas. This is especially true for large populated areas. The information we will obtain through Foursquare is based on coordinates. Information about the shops around these coordinates can be obtained. When these coordinates are at the wrong point of the cities, we cannot get the correct information. Therefore, in our study, not large, but medium-sized cities will be examined. These kinds of cities are already planned to be made in investments.

Map (Big Cities)

In [336]:
NSachsen_High=NSachsen[(NSachsen['Population']>50000)]
Map_big = folium.Map(location=[53.007, 9.043], zoom_start=7)
toSimpleMap (Map_big, NSachsen_High['Lat'], NSachsen_High['Long'],NSachsen_High['Name'])
Map_big
Out[336]:
Make this Notebook Trusted to load map: File -> Trust Notebook

In this map we see the major cities in Niedersachsen. The locations of the cities on this map correspond to the population density map. It is a useful map in terms of controlling the location information we have acquired ourselves.

Middle Cities in Niedersachsen

Let's find list and populations for middle cities.

In [466]:
NSachsen_Medium=NSachsen[(NSachsen['Population']<20000)&(NSachsen['Population']>5000)]
print(NSachsen_Medium.shape)
NSachsen_Medium.head()
(217, 5)
Out[466]:
Name Lat Long County Population
1 Adelebsen 51.579484 9.752448 Göttingen 6245.0
4 Adendorf 53.281748 10.439299 Lüneburg 10853.0
5 Aerzen 52.049607 9.263816 Hameln-Pyrmont 10524.0
10 Ahlerstedt 53.406983 9.452321 Stade 5451.0
13 Alfeld (Leine) 51.986308 9.824747 Hildesheim 18535.0

217 medium-sized cities of Niedersachsen have been identified. These cities have a population of between 5,000 and 20,000. We already have the location and population information of these cities.

Population Histogram : Middle Cities

It is considered that the histogram will be useful to better understand the population information of these cities with medium-sized populations.

In [467]:
count, bin_edges = np.histogram(NSachsen_Medium['Population'])
print(count) # frequency count 
print(bin_edges) # bin ranges, default = 10 bins
[43 35 31 24 22 25  9 11  7 10]
[ 5085.   6560.6  8036.2  9511.8 10987.4 12463.  13938.6 15414.2 16889.8
 18365.4 19841. ]
In [468]:
NSachsen_Medium['Population'].plot(kind='hist', figsize=(16, 5))
plt.title('Niedersachsen Medium Cities') # add a title to the histogram 
plt.ylabel('Number of Cities') # add y-label 
plt.xlabel('Population') # add x-label
plt.show()

As can be seen, the cities we are studying have usually a population of between 5000 and 15000. For example, about 150 cities have a population between 5000 and 15000. The remaining 70 cities have a population of between 15000 and 20000. However, cities with a population of between 15000 and 20000 still have a significant number. In this study, all of these cities with medium-sized population will be examined. Thus, we have used our Niedersachsen demographic information in our study.

Map (Middle Cities)

Medium sized cities have been mapped.

In [110]:
Map = folium.Map(location=[53.007, 9.043], zoom_start=7)
toSimpleMap (Map, NSachsen_Medium['Lat'], NSachsen_Medium['Long'],NSachsen_Medium['Name'])
Map
Out[110]:
Make this Notebook Trusted to load map: File -> Trust Notebook

It is clear from this map that we can get an idea of the whole state of Niedersachsen by examining the medium-sized cities. As it seems, almost the entire state of Niedersachsen is covered. However, it is striking that the surroundings of large cities appear empty on the map. Because there are densely populated settlements around big cities.




Let's create a function to find the Venues (shops etc.)

In [324]:
def getNearbyVenues(names, latitudes, longitudes, populations, radius=3000):
    LIMIT=100
    venues_list=[]
    for name, lat, lng, pop in zip(names, latitudes, longitudes, populations):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng,
            pop,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude',
                  'City Population',           
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

The Venues will be extracted

Now write the code to run the above function on each Middle-City and create a new dataframe called NSachsen_Medium_venues.

In this study, shops, namely venues, in cities with medium-sized population are examined. It is desired to reach the information about what kind of businesses are there in these cities. We will access this information via Foursquare.

In [115]:
NSachsen_Medium_venues = getNearbyVenues(names=NSachsen_Medium['Name'],
                                   latitudes=NSachsen_Medium['Lat'],
                                   longitudes=NSachsen_Medium['Long'],
                                   populations=NSachsen_Medium['Population']             
                                  )
Adelebsen
Adendorf
Aerzen
Ahlerstedt
Alfeld (Leine)
Algermissen
Ankum
Apen
Auetal
Bad Bentheim
Bad Bevensen
Bad Essen
Bad Fallingbostel
Bad Gandersheim
Bad Grund (Harz)
Bad Iburg
Bad Laer
Bad Lauterberg im Harz
Bad Münder am Deister
Bad Nenndorf
Bad Pyrmont
Bad Rothenfelde
Bad Sachsa
Bad Salzdetfurth
Bakum
Bardowick
Barnstorf
Barßel
Bassum
Belm
Bersenbrück
Beverstedt
Bienenbüttel
Bispingen
Bissendorf
Bleckede
Bockenem
Bodenwerder
Bohmte
Bomlitz
Bösel
Bovenden
Brake (Unterweser)
Braunlage
Bremervörde
Bruchhausen-Vilsen
Bückeburg
Bunde
Butjadingen
Calberlah
Cappeln
Clausthal-Zellerfeld
Coppenbrügge
Cremlingen
Dannenberg
Delligsen
Diekholzen
Diepholz
Dinklage
Dissen am Teutoburger Wald
Dörpen
Dörverden
Dötlingen
Drochtersen
Ebstorf
Edemissen
Elsfleth
Elze
Emlichheim
Emmerthal
Emsbüren
Emstek
Eschede
Esens
Essen/Oldenburg
Esterwegen
Faßberg
Fredenbeck
Freren
Fürstenau
Geeste
Glandorf
Gleichen
Gnarrenburg
Goldenstedt
Grasberg
Großefehn
Großenkneten
Großheide
Hage
Hagen am Teutoburger Wald
Hagen im Bremischen
Hambergen
Hambühren
Hanstedt
Hardegsen
Harsefeld
Harsum
Hasbergen
Haselünne
Hemmingen
Hemmoor
Herzberg am Harz
Hessisch Oldendorf
Hilter am Teutoburger Wald
Himmelpforten
Hinte
Hohenhameln
Holdorf
Holle
Holzminden
Horneburg
Ihlow
Isenbüttel
Jesteburg
Jever
Kalefeld
Katlenburg-Lindau
Kirchlinteln
Königslutter am Elm
Krummhörn
Lachendorf
Lamspringe
Langelsheim
Lathen
Lemwerder
Lengede
Liebenburg
Lilienthal
Löningen
Loxstedt
Meine
Meinersen
Molbergen
Moringen
Neuenhaus
Neuenkirchen-Vörden
Norderney
Nordstemmen
Nörten-Hardenberg
Obernkirchen
Ostercappeln
Ostrhauderfehn
Otterndorf
Ottersberg
Ovelgönne
Oyten
Pattensen
Quakenbrück
Rehburg-Loccum
Reppenstedt
Rhauderfehn
Ritterhude
Rodenberg
Rosengarten
Rühen
Salzbergen
Salzhemmendorf
Sarstedt
Sassenburg
Saterland
Scheeßel
Schellerten
Schiffdorf
Schladen-Werla
Schneverdingen
Schöningen
Schöppenstedt
Schüttorf
Schwarmstedt
Schwülper
Seesen
Sibbesse
Sickte
Sittensen
Sögel
Söhlde
Sottrum
Spelle
Stadland
Stadtoldendorf
Steinfeld (Oldenburg)
Stelle
Steyerberg
Stolzenau
Südbrookmerland
Südheide
Sulingen
Thedinghausen
Tostedt
Twist
Twistringen
Uelsen
Uplengen
Uslar
Vechelde
Visbek
Visselhövede
Wagenfeld
Wangerland
Wardenburg
Wathlingen
Weener
Wendeburg
Wennigsen
Werlte
Wesendorf
Wiefelstede
Wiesmoor
Wietmarschen
Wietze
Wittingen
Worpswede
Wrestedt
Wurster Nordseeküste
Zetel
Zeven

Let's check the size of the resulting dataframe

In [469]:
print(NSachsen_Medium_venues.shape)
NSachsen_Medium_venues.head()
(1398, 8)
Out[469]:
City City Latitude City Longitude City Population Venue Venue Latitude Venue Longitude Venue Category
0 Adelebsen 51.579484 9.752448 6245.0 REWE 51.578532 9.759187 Liquor Store
1 Adelebsen 51.579484 9.752448 6245.0 Elektro Bitzer 51.578961 9.752279 Electronics Store
2 Adelebsen 51.579484 9.752448 6245.0 Bahnhof Adelebsen 51.577796 9.759529 Train Station
3 Adelebsen 51.579484 9.752448 6245.0 Edeka Adelebsen 51.586090 9.758766 Supermarket
4 Adelebsen 51.579484 9.752448 6245.0 Bahnhof Lödingsen 51.589336 9.790792 Train Station

Thus, we learned about 1398 Venue. We now know their locations, the cities they are in, the locations of these cities and the populations of these cities. This information constitutes the basis of our work. In this study, a conclusion will be made based on the information.

!!! We have 1398 Venues in all middle cities in Niedersachsen (based on our criterias: in 3 km of the points )


Let's see Venues on Map

These venues have been transferred to the map. It seems that all of them are distributed throughout Niedersachsen.These are investments that are open to the daily use of the people in medium-sized settlements in the German state of Niedersachsen.

In [414]:
venues = folium.Map(location=[52.0, 9.0], zoom_start=7)

for lat,lng,city in zip(NSachsen_Medium_venues['Venue Latitude'], NSachsen_Medium_venues['Venue Longitude'],NSachsen_Medium_venues['Venue']):
    folium.CircleMarker(
    [lat,lng],
    radius=0.3,
    color='red',
    parse_html=False).add_to(venues)
venues
Out[414]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Methodology

In this project we will direct our efforts on detecting rural areas of Niedersachsen that have low shop density, particularly those with medium number of population. We will limit our analysis to area ~3km around city center.

In first step we have collected the required data: location and type (category) of every shops within 3km from each city center(according to Foursquare categorisation). We have also Population Data for each City.

Second step in our analysis will be calculation and exploration of 'shop density' across different areas of Niedersachsen - we will use Scatter Plots and Histograms to identify a few promising areas close to center with high number of shops in general.

In third and final step we will focus on most promising areas and within those create clusters of locations that meet some basic requirements established in discussion with stakeholders: we will take into consideration locations with diffirent kinds of Venue Category, and we want locations with high population. We will present map of all such locations but also create clusters (using k-means clustering) of those locations to identify zones / cities / addresses and search for optimal venue locations by stakeholders.

Analysis

Let's perform some basic explanatory data analysis and derive some additional info from our raw data. First let's count the number of Venues in every City:

The Count of All Venues for each Middle-City in Niedersachsen

In [612]:
NSachsen_Medium_venues_count = NSachsen_Medium_venues[['City']].groupby(['City'])['City'] \
                                                         .count() \
                                                         .reset_index(name='count') \
                                                         .sort_values(['count'], ascending=False) \
                                                         
print(NSachsen_Medium_venues_count.shape)
NSachsen_Medium_venues_count.head(25)
(217, 2)
Out[612]:
City count
100 Hemmingen 42
1 Adendorf 25
33 Bispingen 23
42 Braunlage 22
125 Lemwerder 21
9 Bad Bentheim 18
19 Bad Nenndorf 16
48 Bückeburg 16
4 Alfeld (Leine) 14
25 Bardowick 14
115 Jever 14
20 Bad Pyrmont 13
21 Bad Rothenfelde 13
12 Bad Fallingbostel 13
40 Bovenden 13
110 Holzminden 12
150 Reppenstedt 12
146 Oyten 12
158 Sarstedt 12
212 Worpswede 12
128 Lilienthal 11
216 Zeven 11
120 Königslutter am Elm 11
51 Clausthal-Zellerfeld 11
103 Hessisch Oldendorf 10

We now know the Venue numbers for each city. For example, there are 42 Venue in Hemmingen. Hemmingen is a highly populated city. Therefore, the number of Venue is also very high. I wonder if this applies to all cities? So, does a city's population increase in proportion to its Venue number? The best way to find the answer to this question is to compare the population of each city with the number of Venue. It is convenient to do this comparison with line plot. For this, of course, applying normalization to the population column will give a better result. Let's see the relationship between Population and Count.

In [530]:
NSachsen_Medium_venues_count1=NSachsen_Medium_venues_count.reset_index()
NSachsen_Medium_venues_count2=NSachsen_Medium_venues_count1.join(NSachsen_Medium.set_index('Name'), on='City')

from scipy.stats import zscore
NSachsen_Medium_venues_count2['Population']=NSachsen_Medium_venues_count2[['Population']].apply(zscore)

NSachsen_Medium_venues_count2[['count','Population']].plot(kind='line', figsize=(14, 8))
Out[530]:
<matplotlib.axes._subplots.AxesSubplot at 0x2333762bd08>

It appears that there is no connection in the Line Plot. In other words, with the decrease in the number of Venue, the population number does not decrease regularly. Based on the information we have, we cannot establish a relationship between the population and the Venue Count. An important reason for this is that, in my opinion, the information received from Foursquare is not at a perfect level. Normally, as the population increases, the number of Venue there should also increase. Another reason may be the accuracy of our locations, that is, our latitude and longitude information. The latitude and longitude information specified in the information in the hand does not always indicate the center of the city in terms of population density. Latitude Longitude information sometimes shows the geographical center of the city. Therefore, this point can sometimes be a location in the middle of agricultural lands. Therefore, information received from Foursquare may not be secure enough. However, we will ignore this situation in this study. Because our primary goal is to follow a certain methodology. We will see the result together.

Histogram :Counf of Venues

We will use the histogram plot to analyze the Venue numbers more closely.

In [532]:
count, bin_edges = np.histogram(NSachsen_Medium_venues_count['count'],4)
In [533]:
NSachsen_Medium_venues_count['count'].plot(kind='hist', figsize=(16, 5), color=['red'], bins=4)
plt.title('Niedersachsen Medium Cities Venues')  
plt.ylabel('Number of Cities')
plt.xlabel('Count of Venues') 
plt.show()

As seen in the histogram plot, there are less than 10 venues in most cities. The biggest reason for this is that the cities we studied are generally settlements with small population density. Our aim is to invest in these regions. Therefore, the obtained Venue number information does not logically completely conflict with the population information. However, it should be kept in mind that there is no linear regression between the population and Venue numbers.

Let's have a new Dataframe that shows the Venue Counts and Population together

May be we can cluster Cities based on Population or Venue Counts. We know that there is no linear relationship between population and Venue Count. This does not mean that there is no link between them. So we will try to make a grouping based on these two pieces of information. Our aim is to see if a meaningful photo will come out.

In [447]:
NSachsen_Medium_venues_count_Pop=NSachsen_Medium_venues_count.join(NSachsen_Medium.set_index('Name'), on='City')
In [534]:
print(NSachsen_Medium_venues_count_Pop.shape) # add population
NSachsen_Medium_venues_count_Pop.head()
(217, 6)
Out[534]:
City count Lat Long County Population
100 Hemmingen 42 52.316700 9.750000 Region Hannover 18974.0
1 Adendorf 25 53.281748 10.439299 Lüneburg 10853.0
33 Bispingen 23 53.082303 9.996472 Heidekreis 6410.0
42 Braunlage 22 51.726439 10.610052 Goslar 5795.0
125 Lemwerder 21 53.161712 8.608151 Wesermarsch 7122.0

And now, let's have 4 type of City "Based on Venue Count and Populations"

In [535]:
clustering_totalVenuePops=NSachsen_Medium_venues_count_Pop.drop(columns=["City","Lat","Long","County"])
kclusters_totalVenuePops = 4
kmeans_totalVenuePops = KMeans(n_clusters=kclusters_totalVenuePops, random_state=0).fit(clustering_totalVenuePops)
NSachsen_Medium_venues_count_Pop.insert(0, 'Cluster Labels_TotalVenueCount', kmeans_totalVenueCounts.labels_)
In [536]:
NSachsen_Medium_venues_count_Pop.head()
Out[536]:
Cluster Labels_TotalVenueCount City count Lat Long County Population
100 2 Hemmingen 42 52.316700 9.750000 Region Hannover 18974.0
1 1 Adendorf 25 53.281748 10.439299 Lüneburg 10853.0
33 1 Bispingen 23 53.082303 9.996472 Heidekreis 6410.0
42 1 Braunlage 22 51.726439 10.610052 Goslar 5795.0
125 1 Lemwerder 21 53.161712 8.608151 Wesermarsch 7122.0
In [537]:
count, bin_edges = np.histogram(NSachsen_Medium_venues_count_Pop['Cluster Labels_TotalVenueCount'],bins=4)
NSachsen_Medium_venues_count_Pop['Cluster Labels_TotalVenueCount'].plot(kind='hist', figsize=(16, 5),bins=4)
plt.title('Groups based on Population and Counts') # add a title to the histogram 
plt.ylabel('Number of Cities') # add y-label 
plt.xlabel('Groups') # add x-label
plt.show()

As seen in the histogram, no significant grouping occurred. Therefore, the population information we have is not suitable to be used in the classification of cities in Niedersachsen. At least we say this based on the information we have. Whether this reflects the truth is not clear. To understand this, we must firmly trust the accuracy of the Venue number in each city. Therefore, in our study, we operate with the information we have. Given the information we have, population information is not useful in classifying cities in terms of Venue. Let's confirm this once with a scatter plot.

Let's see Populations and Venue Counts together

In [290]:
NSachsen_Medium_venues_count_Pop.plot(kind='scatter', x='Population', y='count', color='darkblue', figsize=(10,6))
Out[290]:
<matplotlib.axes._subplots.AxesSubplot at 0x2331f565508>

The situation that appears in the Scatter Plot also supports our conclusion. In other words, the distinguishing features of cities are not the number of Venues and population information we have. So we have to follow another method in order to determine the distinctive features of cities. At this point, we will take a look at the category information of the Venues we have.


Let's now explore the Categories of Venues

In [133]:
print('There are {} uniques categories.'.format(len(NSachsen_Medium_venues['Venue Category'].unique())))
There are 184 uniques categories.
In [557]:
categories=NSachsen_Medium_venues['Venue Category'].unique()
categories[40:60]
Out[557]:
array(['Plaza', 'Hot Spring', 'Greek Restaurant', 'Sporting Goods Shop',
       'Discount Store', 'Garden Center', 'Outlet Store',
       'Trattoria/Osteria', 'Chinese Restaurant', 'Bed & Breakfast',
       'Gastropub', 'Tea Room', 'French Restaurant', 'Grocery Store',
       'Gym / Fitness Center', 'ATM', 'Beer Store', 'BBQ Joint',
       'Fried Chicken Joint', 'Convention Center'], dtype=object)

We have around 1400 venue information for more than 200 cities in Niedersachsen. We have around 1400 venue information for more than 200 cities in Niedersachsen. We examine the category information of these 1400 venues. The category type of the venues that we have knowledge of is 184. So there are 184 categories. Drugstore, Nightclub, Pet store, Historic Site, Bakery, French Restaurant and Tea Room are some of them.

We see here Categories of all 1398 Venues

Let's create a new dataframe to see how much Venue in which city and category.

In [558]:
#one hot encoding
NSachsen_onehot = pd.get_dummies(NSachsen_Medium_venues[['Venue Category']], prefix="", prefix_sep="")
#Germany_onehot.drop(['City'],axis=1,inplace=True) 
NSachsen_onehot.insert(loc=0, column='City Population', value=NSachsen_Medium_venues['City Population'])
NSachsen_onehot.insert(loc=0, column='City Longitude', value=NSachsen_Medium_venues['City Longitude'])
NSachsen_onehot.insert(loc=0, column='City Latitude', value=NSachsen_Medium_venues['City Latitude'])
NSachsen_onehot.insert(loc=0, column='City', value=NSachsen_Medium_venues['City'])

NSachsen_onehot.head(10)
Out[558]:
City City Latitude City Longitude City Population ATM Airport Alternative Healer American Restaurant Apres Ski Bar Art Gallery ... Trail Train Station Tram Station Trattoria/Osteria Truck Stop Water Park Waterfront Wine Bar Yoga Studio Zoo
0 Adelebsen 51.579484 9.752448 6245.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 Adelebsen 51.579484 9.752448 6245.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 Adelebsen 51.579484 9.752448 6245.0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
3 Adelebsen 51.579484 9.752448 6245.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 Adelebsen 51.579484 9.752448 6245.0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
5 Adendorf 53.281748 10.439299 10853.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 Adendorf 53.281748 10.439299 10853.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 Adendorf 53.281748 10.439299 10853.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 Adendorf 53.281748 10.439299 10853.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 Adendorf 53.281748 10.439299 10853.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

10 rows × 188 columns

In [353]:
NSachsen_onehot.shape
Out[353]:
(1398, 188)

Let's see Cities and their Venue Counts based on their Categories

We will use it also for clustering

In [560]:
city_venues_count=NSachsen_onehot.groupby(['City']).sum().reset_index()
city_venues_count=city_venues_count.drop(columns=["City Latitude","City Longitude","City Population"])
city_venues_count.head()
Out[560]:
City ATM Airport Alternative Healer American Restaurant Apres Ski Bar Art Gallery Arts & Crafts Store Asian Restaurant Athletics & Sports ... Trail Train Station Tram Station Trattoria/Osteria Truck Stop Water Park Waterfront Wine Bar Yoga Studio Zoo
0 Adelebsen 0 0 0 0 0 0 0 0 0 ... 0 2 0 0 0 0 0 0 0 0
1 Adendorf 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 Aerzen 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 Ahlerstedt 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 Alfeld (Leine) 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 185 columns

Yes, in this dataframe, we see how many of which kind of venue there are in which city. For example, there are 2 train stations in Adelebsen. But there is no Asian Restaurant there. There is a train station in Bad Fallingbostel. Yes I have been at Bad fallinbostel for a while and I personally confirm this information.

However, we noticed that the Venue count information was not satisfactory enough as a distinguishing feature in the previous operations. So we need new and more useful information here. This information is the venue rates. Which venue is there at what rate? Accordingly, the general characteristics of the cities can be reached. And a grouping can be made based on this information. We will make our investment decisions based on this grouping.

Let's see Cities and their mean Values for each Venue Category

We will also use it for clustering

In [141]:
NSachsen_grouped = NSachsen_onehot.groupby('City').mean().reset_index()
NSachsen_grouped.head()
Out[141]:
City City Latitude City Longitude City Population ATM Airport Alternative Healer American Restaurant Apres Ski Bar Art Gallery ... Trail Train Station Tram Station Trattoria/Osteria Truck Stop Water Park Waterfront Wine Bar Yoga Studio Zoo
0 Adelebsen 51.579484 9.752448 6245.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 Adendorf 53.281748 10.439299 10853.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 Aerzen 52.049607 9.263816 10524.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 Ahlerstedt 53.406983 9.452321 5451.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 Alfeld (Leine) 51.986308 9.824747 18535.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 188 columns

In [571]:
NSachsen_Medium_venues[NSachsen_Medium_venues['City']=='Adelebsen'][['Venue','Venue Category']]
Out[571]:
Venue Venue Category
0 REWE Liquor Store
1 Elektro Bitzer Electronics Store
2 Bahnhof Adelebsen Train Station
3 Edeka Adelebsen Supermarket
4 Bahnhof Lödingsen Train Station

Looking at this new table, we see that the ratio of Train stations in Adelebsen to other venues in Adelebsen is 40%. In a city with two train stations, it is expected that there will be a lot of venues. However, we know that this is not the case here in reality. Normally, there are many venues in Adelebsen. However, Foursquare informs us that this venue is. According to the information we have, there are a total of 5 venues in the city, but in reality there are at least 30. However, we base our work on Foursquare information. And we will come to a conclusion accordingly. Therefore, we are trying to apply the right method. However, it seems difficult for us to come to an exact grouping conclusion. As a result, we will continue our analysis with a Kmeans algorithm based on category information.

Clustering

We will make two different groupings based on two different dataframe. The first is the datframe that provides the rate information of the venues. The other dataframe gives the information about how many venues there are in each city. First of all, we will examine the rate information.

clustering_1

We have a dataframe showing the rate data of the venues.

In [578]:
NSachsen_grouped.head()
Out[578]:
City City Latitude City Longitude City Population ATM Airport Alternative Healer American Restaurant Apres Ski Bar Art Gallery ... Trail Train Station Tram Station Trattoria/Osteria Truck Stop Water Park Waterfront Wine Bar Yoga Studio Zoo
0 Adelebsen 51.579484 9.752448 6245.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 Adendorf 53.281748 10.439299 10853.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 Aerzen 52.049607 9.263816 10524.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 Ahlerstedt 53.406983 9.452321 5451.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 Alfeld (Leine) 51.986308 9.824747 18535.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 188 columns

In [579]:
NSachsen_grouped_clustering_1=NSachsen_grouped.drop(columns=["City","City Latitude","City Longitude","City Population"])
NSachsen_grouped_clustering_1.head(1)
Out[579]:
ATM Airport Alternative Healer American Restaurant Apres Ski Bar Art Gallery Arts & Crafts Store Asian Restaurant Athletics & Sports Auto Dealership ... Trail Train Station Tram Station Trattoria/Osteria Truck Stop Water Park Waterfront Wine Bar Yoga Studio Zoo
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 rows × 184 columns

In [580]:
# set number of clusters
kclusters_1 = 3

# run k-means clustering
kmeans_1 = KMeans(n_clusters=kclusters_1, random_state=0).fit(NSachsen_grouped_clustering_1)
In [581]:
# check cluster labels generated for each row in the dataframe
kmeans_1.labels_
Out[581]:
array([0, 0, 2, 2, 0, 0, 2, 0, 0, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 1,
       2, 2, 0, 0, 2, 1, 0, 0, 2, 1, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 2, 1,
       2, 1, 0, 1, 0, 0, 0, 2, 0, 1, 0, 1, 2, 0, 2, 1, 1, 2, 0, 2, 0, 0,
       1, 1, 1, 1, 1, 2, 0, 1, 2, 1, 1, 2, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 2, 1, 0, 1, 2, 0, 1, 0, 0, 2, 0, 0, 1, 0, 1, 0, 2, 0, 0, 0,
       2, 0, 0, 1, 2, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1,
       1, 2, 1, 1, 2, 0, 1, 2, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       2, 2, 1, 0, 2, 1, 0, 0, 1, 1, 0, 2, 2, 1, 0, 1, 0, 2, 0, 0, 1, 1,
       1, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 2, 2,
       1, 1, 1, 0, 0, 2, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 2, 1, 2])

Based on the ratio data, we obtained 3 different sets. Now let's make a grouping for each city based on the venue count data.

clustering_2

In [582]:
NSachsen_grouped_clustering_2=city_venues_count.drop(columns=["City"])
NSachsen_grouped_clustering_2.head(1)
Out[582]:
ATM Airport Alternative Healer American Restaurant Apres Ski Bar Art Gallery Arts & Crafts Store Asian Restaurant Athletics & Sports Auto Dealership ... Trail Train Station Tram Station Trattoria/Osteria Truck Stop Water Park Waterfront Wine Bar Yoga Studio Zoo
0 0 0 0 0 0 0 0 0 0 0 ... 0 2 0 0 0 0 0 0 0 0

1 rows × 184 columns

In [584]:
# set number of clusters
kclusters_2 = 3

# run k-means clustering
kmeans_2 = KMeans(n_clusters=kclusters_2, random_state=0).fit(NSachsen_grouped_clustering_2)
In [585]:
# check cluster labels generated for each row in the dataframe
kmeans_2.labels_
Out[585]:
array([2, 1, 2, 2, 1, 2, 1, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 1, 2, 0, 0, 1,
       0, 0, 2, 1, 2, 1, 2, 1, 0, 1, 2, 0, 2, 2, 2, 2, 1, 2, 2, 2, 0, 1,
       2, 1, 2, 2, 1, 2, 2, 0, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2,
       1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2,
       1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2, 0, 2, 1, 1, 1, 2, 1, 2, 2, 2,
       0, 2, 2, 1, 2, 1, 2, 2, 1, 1, 0, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1,
       1, 2, 1, 1, 0, 2, 1, 0, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2,
       0, 2, 1, 2, 0, 2, 2, 2, 2, 1, 2, 2, 0, 1, 2, 1, 2, 2, 2, 2, 1, 1,
       1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2,
       1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1])

Based on the count data, we obtained also 3 different sets.

most common venues

Now let's list the popular venues to give us a clue about the cities.

In [643]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[4:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    for i in range(num_top_venues):
        if row_categories_sorted[i] == 0:
            row_categories_sorted.index.values[i]=''
    
    return row_categories_sorted.index.values[0:num_top_venues]
In [654]:
num_top_venues = 30

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City','Lat','Long','Population']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
        

# create a new dataframe
cities_venues_sorted = pd.DataFrame(columns=columns)
cities_venues_sorted['City'] = NSachsen_grouped['City']
cities_venues_sorted['Lat'] = NSachsen_grouped['City Latitude']
cities_venues_sorted['Long'] = NSachsen_grouped['City Longitude']
cities_venues_sorted['Population'] = NSachsen_grouped['City Population']

for ind in np.arange(NSachsen_grouped.shape[0]):
    cities_venues_sorted.iloc[ind, 4:] = return_most_common_venues(NSachsen_grouped.iloc[ind, :], num_top_venues)

cities_venues_sorted.head()
Out[654]:
City Lat Long Population 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue ... 21th Most Common Venue 22th Most Common Venue 23th Most Common Venue 24th Most Common Venue 25th Most Common Venue 26th Most Common Venue 27th Most Common Venue 28th Most Common Venue 29th Most Common Venue 30th Most Common Venue
0 Adelebsen 51.579484 9.752448 6245.0 Train Station Liquor Store Electronics Store Supermarket ...
1 Adendorf 53.281748 10.439299 10853.0 Supermarket Shopping Mall Gym Hardware Store Golf Course Furniture / Home Store ...
2 Aerzen 52.049607 9.263816 10524.0 Hotel Golf Course Italian Restaurant Liquor Store ...
3 Ahlerstedt 53.406983 9.452321 5451.0 Supermarket Construction & Landscaping Hotel Gas Station ...
4 Alfeld (Leine) 51.986308 9.824747 18535.0 Supermarket Bakery Italian Restaurant Fast Food Restaurant Big Box Store Eastern European Restaurant ...

5 rows × 34 columns

The popular venue information for each city appears in the list. The Venues of only 25 cities have more than 10 catogories. Most of them have less than 10 categories. So this Dataset is good for understanding the characteristic features of cities. However, this is not very important as our grouping process is not based on this dataset. We did not use this table for clustering. We put it here just to see the general situation.

Add Both of 2 Clustering Labels to Most Common Venues

Now let's add the group informations we obtained based on the category information to this table.

In [656]:
cities_venues_sorted.insert(0, 'Cluster Labels', kmeans_1.labels_)
In [657]:
cities_venues_sorted.insert(0, 'Cluster Labels(count)', kmeans_2.labels_)
In [658]:
cities_venues_sorted.head()
Out[658]:
Cluster Labels(count) Cluster Labels City Lat Long Population 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue ... 21th Most Common Venue 22th Most Common Venue 23th Most Common Venue 24th Most Common Venue 25th Most Common Venue 26th Most Common Venue 27th Most Common Venue 28th Most Common Venue 29th Most Common Venue 30th Most Common Venue
0 2 0 Adelebsen 51.579484 9.752448 6245.0 Train Station Liquor Store Electronics Store Supermarket ...
1 1 0 Adendorf 53.281748 10.439299 10853.0 Supermarket Shopping Mall Gym Hardware Store ...
2 2 2 Aerzen 52.049607 9.263816 10524.0 Hotel Golf Course Italian Restaurant Liquor Store ...
3 2 2 Ahlerstedt 53.406983 9.452321 5451.0 Supermarket Construction & Landscaping Hotel Gas Station ...
4 1 0 Alfeld (Leine) 51.986308 9.824747 18535.0 Supermarket Bakery Italian Restaurant Fast Food Restaurant ...

5 rows × 36 columns

Histogram of Clusters

In [597]:
count, bin_edges = np.histogram(cities_venues_sorted['Cluster Labels'], bins=3)
print(count) # frequency count 
print(bin_edges) # bin ranges, default = 10 bins
[99 68 50]
[0.         0.66666667 1.33333333 2.        ]
In [598]:
cities_venues_sorted['Cluster Labels'].plot(kind='hist', figsize=(8, 5), bins=3)
plt.title('Niedersachsen Medium Cities Cluster Labels') # add a title to the histogram 
plt.ylabel('Number of Cities') # add y-label 
plt.xlabel('Cluster Labels') # add x-label
plt.show()

As a result of clustering based on venue ratios, groups of 99, 65 and 50 cities have been formed.

In [599]:
count, bin_edges = np.histogram(cities_venues_sorted['Cluster Labels(count)'], bins=3)
print(count) # frequency count 
print(bin_edges) # bin ranges, default = 10 bins
[ 18  83 116]
[0.         0.66666667 1.33333333 2.        ]
In [600]:
cities_venues_sorted['Cluster Labels(count)'].plot(kind='hist', figsize=(8, 5), bins=3)
plt.title('Niedersachsen Medium Cities Cluster Labels(count)') # add a title to the histogram 
plt.ylabel('Number of Cities') # add y-label 
plt.xlabel('Cluster Labels(count)') # add x-label
plt.show()

As a result of clustering based on venue counts, groups of 116, 83 and 18 cities have been formed.

When we examine both histograms, we see that they are actually very similar. In the first histogram, the first group corresponds to the last group in the second histogram. However, as stated before, we know that the count information is not determinative based on the quality of the information we have. So we have to go to the result according to the rate information. Therefore, we will make an evaluation based on the first grouping method.



And now let's see, the Relationship between Both "2 Clusters" and "Venue Counts and Populations of Venues

We are going to see it via 4 Scatter Plots

For that we need a new DataFrame like this: Cluster Labels / Cluster Labels(count) / population / count

In [614]:
Clusters_Populations_VenueCounts=cities_venues_sorted[['City','Lat', 'Long','Cluster Labels','Cluster Labels(count)','Population']]
Clusters_Populations_VenueCounts=Clusters_Populations_VenueCounts.join(NSachsen_Medium_venues_count.set_index('City'), on='City')
Clusters_Populations_VenueCounts.head()
Out[614]:
City Lat Long Cluster Labels Cluster Labels(count) Population count
0 Adelebsen 51.579484 9.752448 0 2 6245.0 5
1 Adendorf 53.281748 10.439299 0 1 10853.0 25
2 Aerzen 52.049607 9.263816 2 2 10524.0 4
3 Ahlerstedt 53.406983 9.452321 2 2 5451.0 4
4 Alfeld (Leine) 51.986308 9.824747 0 1 18535.0 14
In [615]:
Clusters_Populations_VenueCounts.plot(kind='scatter', x='Population', y='Cluster Labels', color='darkblue', figsize=(10,6))
Out[615]:
<matplotlib.axes._subplots.AxesSubplot at 0x23337833308>

There is no relationship between 'Cluster Labels' and 'Population'.

In [616]:
Clusters_Populations_VenueCounts.plot(kind='scatter', x='count', y='Cluster Labels', color='darkblue', figsize=(10,6))
Out[616]:
<matplotlib.axes._subplots.AxesSubplot at 0x23333f7e948>

There is no relationship between 'Cluster Labels' and 'Venue Counts'.

In [617]:
Clusters_Populations_VenueCounts.plot(kind='scatter', x='Population', y='Cluster Labels(count)', color='darkblue', figsize=(10,6))
Out[617]:
<matplotlib.axes._subplots.AxesSubplot at 0x23333f27e48>

There is no relationship between 'Cluster Labels(count)' and 'Population'.

In [618]:
Clusters_Populations_VenueCounts.plot(kind='scatter', x='count', y='Cluster Labels(count)', color='darkblue', figsize=(10,6))
Out[618]:
<matplotlib.axes._subplots.AxesSubplot at 0x23333e2c608>

There is no relationship between 'Cluster Labels(count)' and 'Venue Counts'.

As a result, there was no significant correlation between population and venue counts and grouping.

Now let's see the groups of cities on the map. Perhaps the geographical features of the cities may have influenced the grouping.

Clustered Map (All Clusters)

In [633]:
map_clusters_1 = folium.Map(location=[53.007, 9.043], zoom_start=7)

for lat, lon, poi, cluster in zip(cities_venues_sorted['Lat'], cities_venues_sorted['Long'], cities_venues_sorted['City'], cities_venues_sorted['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    if cluster==0:
        groupcolor='green'
    if cluster==1:
        groupcolor='blue'
    if cluster==2:
        groupcolor='red'
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=groupcolor,
        fill=True,
        fill_color=groupcolor,
        fill_opacity=0.7).add_to(map_clusters_1)
       
map_clusters_1
Out[633]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Let's see now our Clusters

Cluster 1

In [690]:
cluster_1_0=cities_venues_sorted.loc[cities_venues_sorted['Cluster Labels'] == 0, cities_venues_sorted.columns[[1] + list(range(2, cities_venues_sorted.shape[1]))]]
cluster_1_0.head()
Out[690]:
Cluster Labels City Lat Long Population 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue ... 21th Most Common Venue 22th Most Common Venue 23th Most Common Venue 24th Most Common Venue 25th Most Common Venue 26th Most Common Venue 27th Most Common Venue 28th Most Common Venue 29th Most Common Venue 30th Most Common Venue
0 0 Adelebsen 51.579484 9.752448 6245.0 Train Station Liquor Store Electronics Store Supermarket ...
1 0 Adendorf 53.281748 10.439299 10853.0 Supermarket Shopping Mall Gym Hardware Store Golf Course ...
4 0 Alfeld (Leine) 51.986308 9.824747 18535.0 Supermarket Bakery Italian Restaurant Fast Food Restaurant Big Box Store ...
5 0 Algermissen 52.251407 9.967904 7918.0 Construction & Landscaping Mobile Phone Shop Liquor Store Light Rail Station ...
7 0 Apen 53.221426 7.809734 11692.0 Construction & Landscaping Insurance Office Diner German Restaurant Furniture / Home Store ...

5 rows × 35 columns

Cluster 1 on Map

In [692]:
map_clusters_1_0 = folium.Map(location=[53.007, 9.043], zoom_start=7)
toSimpleMap (map_clusters_1_0, cluster_1_0['Lat'], cluster_1_0['Long'],cluster_1_0['City'])
map_clusters_1_0
Out[692]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Cluster 2

In [681]:
cluster_1_1=cities_venues_sorted.loc[cities_venues_sorted['Cluster Labels'] == 1, cities_venues_sorted.columns[[1] + list(range(2, cities_venues_sorted.shape[1]))]]
cluster_1_1
Out[681]:
Cluster Labels City Lat Long Population 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue ... 21th Most Common Venue 22th Most Common Venue 23th Most Common Venue 24th Most Common Venue 25th Most Common Venue 26th Most Common Venue 27th Most Common Venue 28th Most Common Venue 29th Most Common Venue 30th Most Common Venue
21 1 Bad Rothenfelde 52.111020 8.161528 8470.0 Supermarket Hotel Ice Cream Shop Fast Food Restaurant Pool ...
27 1 Barßel 53.169999 7.743417 13039.0 Supermarket BBQ Joint Shoe Store ...
31 1 Beverstedt 53.434064 8.818337 13545.0 Supermarket Bakery ...
43 1 Bremervörde 53.485025 9.136209 18582.0 Supermarket Drugstore Fast Food Restaurant Electronics Store ...
45 1 Bunde 53.184017 7.269545 7675.0 Supermarket Border Crossing Ice Cream Shop Intersection Gas Station ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
206 1 Wesendorf 52.592945 10.534737 5378.0 Supermarket Hotel Photography Studio Rental Service ...
207 1 Wiefelstede 53.255863 8.115012 16162.0 Supermarket Fish Market Trail Bus Stop German Restaurant ...
208 1 Wiesmoor 53.411444 7.730944 13236.0 Supermarket Insurance Office Golf Course Business Service ...
213 1 Wrestedt 52.904303 10.584307 6366.0 Supermarket Boat or Ferry Train Station ...
215 1 Zetel 53.415921 7.970576 11813.0 Supermarket Drugstore Miscellaneous Shop ...

68 rows × 35 columns

Cluste 2 on Map

In [651]:
map_clusters_1_1 = folium.Map(location=[53.007, 9.043], zoom_start=7)
toSimpleMap (map_clusters_1_1, cluster_1_1['Lat'], cluster_1_1['Long'],cluster_1_1['City'])
map_clusters_1_1  
Out[651]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Cluster 3

In [680]:
cluster_1_2=cities_venues_sorted.loc[cities_venues_sorted['Cluster Labels'] == 2, cities_venues_sorted.columns[[1] + list(range(2, cities_venues_sorted.shape[1]))]]
cluster_1_2
Out[680]:
Cluster Labels City Lat Long Population 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue ... 21th Most Common Venue 22th Most Common Venue 23th Most Common Venue 24th Most Common Venue 25th Most Common Venue 26th Most Common Venue 27th Most Common Venue 28th Most Common Venue 29th Most Common Venue 30th Most Common Venue
2 2 Aerzen 52.049607 9.263816 10524.0 Hotel Golf Course Italian Restaurant Liquor Store ...
3 2 Ahlerstedt 53.406983 9.452321 5451.0 Supermarket Construction & Landscaping Hotel Gas Station ...
6 2 Ankum 52.542303 7.868022 7568.0 Supermarket Hotel Liquor Store German Restaurant Park ...
9 2 Bad Bentheim 52.302479 7.160592 15609.0 Supermarket German Restaurant Hotel Italian Restaurant Steakhouse ...
10 2 Bad Bevensen 53.080941 10.574481 9248.0 Golf Course Hot Spring Hotel Greek Restaurant Liquor Store ...
12 2 Bad Fallingbostel 52.865448 9.693020 11926.0 Supermarket Hotel Fast Food Restaurant Trattoria/Osteria Train Station ...
14 2 Bad Grund (Harz) 51.785465 10.212901 8238.0 Historic Site Hotel Train Station Tea Room ...
16 2 Bad Laer 52.101414 8.086517 9110.0 Hotel Supermarket Gym / Fitness Center Greek Restaurant ...
17 2 Bad Lauterberg im Harz 51.618285 10.456052 10265.0 Supermarket Gastropub Water Park Hotel Train Station ...
19 2 Bad Nenndorf 52.336072 9.376960 11226.0 Supermarket Hotel Park Furniture / Home Store Bakery ...
20 2 Bad Pyrmont 51.985050 9.243464 19130.0 Supermarket Hotel Zoo Drugstore Café ...
22 2 Bad Sachsa 51.596363 10.553423 7372.0 Hotel Supermarket Gas Station Resort Train Station ...
23 2 Bad Salzdetfurth 52.059514 10.006473 13139.0 Train Station Hotel Supermarket Drugstore ...
26 2 Barnstorf 52.708931 8.505279 6465.0 Supermarket Soccer Field Hotel Train Station Restaurant ...
30 2 Bersenbrück 52.555329 7.947184 8618.0 Hotel Drugstore Electronics Store Supermarket ...
36 2 Bockenem 52.011148 10.133116 9855.0 Gas Station American Restaurant Hotel Miscellaneous Shop ...
38 2 Bohmte 52.366681 8.313543 12687.0 Supermarket Drugstore Miscellaneous Shop Hotel Train Station ...
42 2 Braunlage 51.726439 10.610052 5795.0 Hotel German Restaurant Skating Rink Supermarket Bistro ...
44 2 Bruchhausen-Vilsen 52.828865 8.988901 9118.0 Food Service Supermarket Pizza Place Hotel Market ...
51 2 Clausthal-Zellerfeld 51.807040 10.336448 15727.0 Hotel Supermarket Bistro Italian Restaurant Big Box Store ...
56 2 Diekholzen 52.094442 9.915534 6399.0 Supermarket Hotel Farm German Restaurant ...
58 2 Dinklage 52.663181 8.121843 13095.0 Hotel Supermarket Snack Place Ice Cream Shop ...
61 2 Dörpen 52.965332 7.331334 5392.0 Gas Station Hotel Train Station Canal Lock Supermarket ...
63 2 Dötlingen 52.935666 8.382064 6273.0 Hotel Trail Golf Course Italian Restaurant ...
71 2 Emstek 52.833961 8.154950 12254.0 Supermarket Hotel Farm Rental Car Location Greek Restaurant ...
74 2 Essen/Oldenburg 52.718009 7.940260 9078.0 Hotel Train Station German Restaurant Business Service Supermarket ...
77 2 Fredenbeck 53.524302 9.398497 6027.0 Golf Course Drugstore Supermarket ...
90 2 Hagen am Teutoburger Wald 52.200000 7.983330 13475.0 Drugstore Hotel Bus Stop Supermarket ...
94 2 Hanstedt 53.258349 10.021672 5759.0 Zoo Hotel Drugstore Supermarket Snack Place ...
99 2 Haselünne 52.671872 7.483054 13029.0 Clothing Store Truck Stop Drugstore Ice Cream Shop German Restaurant ...
106 2 Hinte 53.413087 7.189096 7106.0 Supermarket Gas Station Fast Food Restaurant Hotel Scenic Lookout ...
110 2 Holzminden 51.828835 9.446659 19841.0 Supermarket Hotel German Restaurant Big Box Store Fast Food Restaurant ...
114 2 Jesteburg 53.308527 9.957988 7969.0 Hotel Gas Station Liquor Store Café ...
120 2 Königslutter am Elm 52.249931 10.817169 15704.0 Hotel Convenience Store Train Station Drugstore Bakery ...
133 2 Molbergen 52.858659 7.921986 8944.0 Supermarket Hotel Breakfast Spot Resort ...
136 2 Neuenkirchen-Vörden 52.510921 8.090159 8868.0 Hotel Supermarket Fast Food Restaurant Truck Stop Train Station ...
139 2 Nörten-Hardenberg 51.630184 9.936601 8379.0 Hotel Train Station Fast Food Restaurant Intersection German Restaurant ...
143 2 Otterndorf 53.808132 8.899715 7311.0 Hotel Big Box Store Beach German Restaurant Lake ...
154 2 Rosengarten 53.398377 9.904898 13672.0 Hotel Rest Area Supermarket Sporting Goods Shop German Restaurant ...
155 2 Rühen 52.485929 10.885291 5798.0 Fast Food Restaurant Hotel Harbor / Marina Greek Restaurant ...
158 2 Sarstedt 52.234038 9.853425 19391.0 Electronics Store Hotel Drugstore Supermarket Mediterranean Restaurant ...
165 2 Schneverdingen 53.116597 9.792590 18750.0 Hotel Ice Cream Shop Grocery Store Train Station Trail ...
166 2 Schwarmstedt 52.677969 9.618078 5745.0 Hotel Drugstore Supermarket ...
171 2 Seesen 51.890557 10.170390 19240.0 Drugstore Supermarket Hotel Intersection ...
182 2 Stolzenau 52.510526 9.077340 7468.0 Supermarket Hotel Train Station Electronics Store ...
196 2 Visbek 52.833772 8.314850 9925.0 Rental Service Supermarket Miscellaneous Shop Hotel ...
197 2 Visselhövede 52.984760 9.578255 9598.0 Hotel Supermarket Train Station ...
203 2 Wendeburg 52.325491 10.386436 10412.0 Supermarket Truck Stop Hardware Store Hotel Snack Place ...
214 2 Wurster Nordseeküste 53.721616 8.592089 17095.0 Golf Course ...
216 2 Zeven 53.299291 9.280382 13822.0 Supermarket Hotel Sandwich Place Greek Restaurant Burger Joint ...

50 rows × 35 columns

Cluster 3 on Map

In [653]:
map_clusters_1_2 = folium.Map(location=[53.007, 9.043], zoom_start=7)
toSimpleMap (map_clusters_1_2, cluster_1_2['Lat'], cluster_1_2['Long'],cluster_1_2['City'])
map_clusters_1_2  
Out[653]:
Make this Notebook Trusted to load map: File -> Trust Notebook

There is no clear link or relationship between the geographic features and the grouping process on the maps.

Currently, we have 3 different groups. These groups are determined according to the venue distribution in the cities. The grouping process was made using the Kmeans algorithm over 184 different categories belonging to 217 cities. We will now examine these three different groups. We have to find the features that distinguish these groups from each other. In this way, the investor will get an idea of which type of investments to make in which group. To do this, we need to examine what kind of venues are in each city in the groups. For this, we need a new dataframe for each group. We will be able to see all types of categories in each city in these data frames. So we will see not just the top ten categories, but all of them. Then we will create wordcloud by converting them into text files. There will be one Wordcloud for each group. As a result, we will reveal group characteristics based on these wordclouds. The investor will determine the fields of entrepreneurship based on these.

While the wordclouds are being prepared, words that may give wrong ideas have been removed. For example, normally the supermarket is Venue that comes first in each group. However, corrections have been made in the texts in order to provide distinction between the Groups.

Cluster icindeki pupulasyon dagilimlari

In [792]:
clustersPie = pd.DataFrame({'clusters': [cluster_1_0.shape[0], cluster_1_1.shape[0], cluster_1_2.shape[0]]})
clustersPie.plot(kind='pie', figsize=(5, 6), autopct='%1.1f%%', subplots=True)
plt.title('Clusters') 
Out[792]:
Text(0.5, 1.0, 'Clusters')
In [791]:
clustersPie = pd.DataFrame({'clusters': [cluster_1_0.Population.mean(), cluster_1_1.Population.mean(), cluster_1_2.Population.mean()]})
clustersPie.plot(kind='pie', figsize=(5, 6), autopct='%1.1f%%', subplots=True)
plt.title('Populations(mean)') 
Out[791]:
Text(0.5, 1.0, 'Populations(mean)')
In [790]:
cluster_1_0['Population'].plot(kind='hist', figsize=(8, 10), bins=10)
plt.title('Cluster 1') # add a title to the histogram 
plt.ylabel('Number of Cities') # add y-label 
plt.xlabel('Population') # add x-label
plt.show()
In [793]:
cluster_1_1['Population'].plot(kind='hist', figsize=(8, 5), bins=10)
plt.title('Cluster 2') # add a title to the histogram 
plt.ylabel('Number of Cities') # add y-label 
plt.xlabel('Population') # add x-label
plt.show()
In [794]:
cluster_1_2['Population'].plot(kind='hist', figsize=(8, 5), bins=10)
plt.title('Cluster 3') # add a title to the histogram 
plt.ylabel('Number of Cities') # add y-label 
plt.xlabel('Population') # add x-label
plt.show()

It is again seen that there is no significant relationship between cluster distribution and population above. So the only element that determines the groups is the Venue distribution.

Cluster-1 Wordcloud

In [671]:
#!conda install -c conda-forge wordcloud==1.7.0 --yes

# import package and its set of stopwords
from wordcloud import WordCloud, STOPWORDS
In [756]:
cluster1=cluster_1_0.drop(columns=['Cluster Labels','City','Lat','Long','Population'])
cluster1.to_csv('cluster1.txt', sep=',', index=False)
In [781]:
cluster1Text = open('cluster1.txt', 'r').read()
stopwords = set(STOPWORDS)

wordcloud1 = WordCloud(width=1600, 
                      height=800,
                      background_color='white',
                      max_words=3000,
                      relative_scaling=0.5,
                      stopwords=stopwords).generate(cluster1Text)

# Open a plot of the generated image.

plt.figure( figsize=(20,10), facecolor='k')
plt.imshow(wordcloud1)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

Store has become the most prominent venue of the group in cluster number 1. This becomes even more evident when considering that the ratio of Liquor Store should be added to this. Therefore, these cities generally have the feature of being settled. As it turns out, the Hotel couldn't find a place for itself here. This means that the cities in question are not actually very colorful. Restaurant and German Restaurant have similar rates in all groups. Therefore, these Venues also does not provide much opportunity for categorization. Therefore, investors should make new investments according to the needs of ordinary daily life in these cities.

Cluster-2 Wordcloud

In [686]:
cluster2=cluster_1_1.drop(columns=['Cluster Labels','City','Lat','Long','Population'])
cluster2.to_csv('cluster2.txt', sep=',', index=False)
In [782]:
cluster2Text = open('cluster2.txt', 'r').read()
stopwords = set(STOPWORDS)

wordcloud2 = WordCloud(width=800, 
                      height=400,
                      background_color='white',
                      max_words=3000,
                      relative_scaling=0.5,
                      stopwords=stopwords).generate(cluster2Text)

# Open a plot of the generated image.

plt.figure( figsize=(20,10), facecolor='k')
plt.imshow(wordcloud2)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

2.Cluster is marked by Restaurant. Other eating and drinking places also support this. Therefore, it can be thought that social life and common activities are intense in these cities. Therefore, it is considered appropriate to make investments in the entertainment sector in these cities.

Cluster-3 Wordcloud

In [685]:
cluster3=cluster_1_2.drop(columns=['Cluster Labels','City','Lat','Long','Population'])
cluster3.to_csv('cluster3.txt', sep=',', index=False)
In [783]:
cluster3Text = open('cluster3.txt', 'r').read()
stopwords = set(STOPWORDS)

wordcloud3 = WordCloud(width=1600, 
                      height=800,
                      background_color='white',
                      max_words=3000,
                      relative_scaling=0.5,
                      stopwords=stopwords).generate(cluster3Text)

# Open a plot of the generated image.

plt.figure( figsize=(20,10), facecolor='k')
plt.imshow(wordcloud3)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

In the 3rd Cluster, the most prominent is Venue Hotel. It can be said that these cities are lively and changeable. It might be thought that new people outside of the city come here often. Therefore, businesses such as gift shops or those that will come from outside of the city will be the right choice. Shopping centers can be among the right choices for these cities. It is considered that restaurant-cafe style initiatives that reflect regional historical characteristics towards touristic historical areas will also be appropriate.

Results and Discussion

Our analysis shows that although there is a number of shops in Niedersachsen (~2000 in our initial area of interest which were middle scale cities).

It has been observed that the changes in the population of cities have no effect on categorization. Therefore, venue distribution in the cities belonging to each group has been tried to be analyzed. Based on this, the evaluations are given under the groups.

As a result: Venues in more than 200 cities have been studied. The population situation in these cities and the total number of venues in these cities have also been evaluated. However, it was concluded that the most important factor in classification is the variety of venues in cities. However, it is worth noting that: most of the cities we studied were cities of less than 10000 inhabitants. In such cities, people do not need applications such as Foursquare when determining the restaurant they will go to. Therefore, it should not be ignored that there is a weakness in terms of information richness.

Conclusion

The aim of this study was to study the medium-sized cities in the German state of Niedersachsen. Based on this analysis, it was aimed to see which investments could be made in which cities. The study was completed at a level where a certain evaluation can be made based on the information we have.

As a result: It is evaluated that investors can make a profit by taking into account the evaluations made about each group. It cannot be denied that we have some commercial knowledge of Niedersachsen with this study.